-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dedup translog operations by reading in reverse #27268
Conversation
Currently, translog operations are read and processed one by one. This may be a problem as stale operations in translogs may suddenly reappear in recoveries. To make sure that stale operations won't be processed, we read the translog files in a reverse order (eg. from the most recent file to the oldest file) and only process an operation if its sequence number was not seen before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this. I left some minor comments. I would also love to see a replication group test that creates collisions between old primary operations and new primary operations with the same sequence number and then performs recovery from it and see they don't re-appear.
you can look at RecoveryTests
to see how to do these things. Another example is IndexLevelReplicationTests#testConflictingOpsOnReplica
. I'm thinking something like:
- 3 shard group with some initial data.
- Remove a replica from the group and index an operation.
- Fail the primary and add the removed replica back into the group. Promote the replica just added to a primary.
- Index one more op to the group . This will create a collision.
- Make sure that peer recovery from the 3 replica (the one that was never removed) works. Note that you'll need to make that shard primary.
PS we have a bug in how replicas deal with the translog when it detects that there's a new primary when an operations acquires a replica lock (as opposed to the cluster state coming in), causing duplicates to be in the same translog gen. @jasontedor is working on a fix but it will be good to see that your test runs into that bug without his fix.
Feel free to reach out. The test I suggested can be complex to set up.
return op; | ||
Translog.Operation op; | ||
while ((op = current.next()) != null) { | ||
if (op.seqNo() < 0 || seenSeqNo.getAndSet(op.seqNo()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we check that the seqNo() is not set (UNASSIGNED_SEQ_NO) instead of a generic <0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
*/ | ||
static final class SeqNumSet { | ||
static final short BIT_SET_SIZE = 1024; | ||
private final LongSet topTier = new LongHashSet(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are not always the top tier - can we instead call it something like completedBitSets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was implementing a multi-levels of bitsets structure. When a bitset at the lower is completed, we move it to the higher level as a single bit in another bitset, and so on. However, I preferred a simpler solution. These names are derived from that structure but are no longer valid. I pushed 5935009
static final class SeqNumSet { | ||
static final short BIT_SET_SIZE = 1024; | ||
private final LongSet topTier = new LongHashSet(); | ||
private final LongObjectHashMap<CountedBitSet> bottomTier = new LongObjectHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ongoingSets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
incompleteBitSets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* The number of operations has been skipped in the snapshot so far. | ||
* Unlike {@link #totalOperations()}, this value is updated each time after {@link #next()}) is called. | ||
*/ | ||
default int skippedOperations() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally find skippedOperations
a bit tricky as we don't really skip operations. Rather we remove operations that have been overridden. Maybe "overridenOperations" and update the java docs to explain what overridden means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed 5cc33bb
}); | ||
} | ||
|
||
public void testTrackSeqNumDenseRanges() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this test add compared to testTrackSeqNumRandomRanges
? I think we can drop testTrackSeqNumRandomRanges
and keep this one? It's better to have move collisions in testing).
public void testTrackSeqNumDenseRanges() throws Exception { | ||
final MultiSnapshot.SeqNumSet bitSet = new MultiSnapshot.SeqNumSet(); | ||
final LongSet normalSet = new LongHashSet(); | ||
IntStream.range(0, between(20_000, 50_000)).forEach(i -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can run up to 10K all the time? it's enough show both collisions and non collisions?
long currentSeq = between(10_000_000, 1_000_000_000); | ||
final int iterations = between(100, 2000); | ||
for (long i = 0; i < iterations; i++) { | ||
List<Long> batch = LongStream.range(currentSeq, currentSeq + between(1, 1000)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make sure batches sometimes crossed the underlying 1024 bitset implementation, to check that we handle completed sets correctly? Maybe we can also expose the number of completed sets in SeqNumSet and test that we see the right number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'd also quite like to see some non-random tests that deliberately hit the corners, rather than relying on the random ones finding everything, and also think it'd be useful to assert that there are the expected number of complete/incomplete sets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed ea11a2a
final MultiSnapshot.SeqNumSet bitSet = new MultiSnapshot.SeqNumSet(); | ||
final LongSet normalSet = new LongHashSet(); | ||
long currentSeq = between(10_000_000, 1_000_000_000); | ||
final int iterations = between(100, 2000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use scaledRandomIntBetween? this means faster runs in intellij and a better coverage in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to know scaledRandomIntBetween
. Thank you.
Scratch this PS. Jason looked more into this and concluded that the missing roll generation when processing an new term from the cluster state can't cause duplicates after all (I know this is vague - I'll explain once we're both online) |
/** | ||
* Sequence numbers from translog are likely to form contiguous ranges, thus using two tiers can reduce memory usage. | ||
*/ | ||
static final class SeqNumSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we call this SequenceNumberSet
instead? Also, this idea feels more widely applicable. Can it be used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be used by MultiSnapshot
only.
} | ||
|
||
/** | ||
* Sequence numbers from translog are likely to form contiguous ranges, thus using two tiers can reduce memory usage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not familiar with the scale of the numbers here, but I take it that the memory usage is an important constraint? Without looking at how the sets all work in great detail, I guess this takes ~10x less memory - does that sound about right? Is that enough? Could the comment include more detail, answering some of these questions? Could you expand the comment to answer some of these questions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adjusted the comment in 5935009 but not going to such detail.
|
||
@Override | ||
public void describeTo(Description description) { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed ebe92b6. Thank you.
while ((op = current.next()) != null) { | ||
if (op.seqNo() < 0 || seenSeqNo.getAndSet(op.seqNo()) == false) { | ||
return op; | ||
}else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: spacing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
# Conflicts: # core/src/test/java/org/elasticsearch/index/translog/TranslogTests.java
@bleskes, @DaveCTurner I have added a replication test and addressed your feedbacks. Could you please take another look? Thank you! |
import static org.hamcrest.CoreMatchers.equalTo; | ||
import static org.hamcrest.Matchers.lessThanOrEqualTo; | ||
|
||
public class MultiSnapshotTests extends ESTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see a deterministic (i.e. no randomness) test that exercises the corners of SeqNumSet
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added 7d3482b
* Sequence numbers from translog are likely to form contiguous ranges, | ||
* thus collapsing a completed bitset into a single entry will reduce memory usage. | ||
*/ | ||
static final class SeqNumSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we call this SequenceNumberSet
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is used only here; a concise name in a private context is better. Anyway, I renamed it to SeqNoSet
as we call SeqNo*
, not SeqNum*
in a3994f8
Looks OK to me - I added a couple more comments as I think they were lost in your changes last time round. I don't feel qualified to give a full LGTM on this as I'm not familiar enough with this area yet. @ywelsch can you? |
@jasontedor, @bleskes With this PR, we will read translog operations in reverse, however, this may cause LocalCheckpointTracker consume more memory than before when executing a peer-recovering. Is it ok if I address this later in a follow-up PR using the elasticsearch/core/src/main/java/org/elasticsearch/indices/recovery/RecoverySourceHandler.java Lines 246 to 255 in 90d6317
|
We can indeed optimize as a follow up. I do wonder if we should keep things simpler and use CountedBitSet until it's full and then free the underlying FixedBitSet and treat this case as "all set". WDYT? Maybe this is good enough for this PR too? |
Thanks @bleskes and @DaveCTurner. I will add a follow-up to simplify the SeqNoSet. |
Currently, translog operations are read and processed one by one. This may be a problem as stale operations in translogs may suddenly reappear in recoveries. To make sure that stale operations won't be processed, we read the translog files in a reverse order (eg. from the most recent file to the oldest file) and only process an operation if its sequence number was not seen before. Relates to #10708
* master: Skip shard refreshes if shard is `search idle` (#27500) Remove workaround in translog rest test (#27530) inner_hits: Return an empty _source for nested inner hit when filtering on a field that doesn't exist. percolator: Avoid TooManyClauses exception if number of terms / ranges is exactly equal to 1024 Dedup translog operations by reading in reverse (#27268) Ensure logging is configured for CLI commands Ensure `doc_stats` are changing even if refresh is disabled (#27505) Fix classes that can exit Revert "Adjust CombinedDeletionPolicy for multiple commits (#27456)" Transpose expected and actual, and remove duplicate info from message. (#27515) [DOCS] Fixed broken link in breaking changes
* 6.x: [DOCS] s/Spitting/Splitting in split index docs inner_hits: Return an empty _source for nested inner hit when filtering on a field that doesn't exist. percolator: Avoid TooManyClauses exception if number of terms / ranges is exactly equal to 1024 Dedup translog operations by reading in reverse (#27268) Ensure logging is configured for CLI commands Ensure `doc_stats` are changing even if refresh is disabled (#27505) Fix classes that can exit Revert "Adjust CombinedDeletionPolicy for multiple commits (#27456)" Transpose expected and actual, and remove duplicate info from message.
Today, we maintain two sets in a SeqNoSet: ongoing sets and completed sets. We can remove the completed sets by releasing the internal bitset of a CountedBitSet when all its bits are set. Relates elastic#27268
Today, we maintain two sets in a SeqNoSet: ongoing sets and completed sets. We can remove the completed sets and use only the ongoing sets by releasing the internal bitset of a CountedBitSet when all its bits are set. This behaves like two sets but simpler. This commit also makes CountedBitSet as a drop-in replacement for BitSet. Relates #27268
Today, we maintain two sets in a SeqNoSet: ongoing sets and completed sets. We can remove the completed sets and use only the ongoing sets by releasing the internal bitset of a CountedBitSet when all its bits are set. This behaves like two sets but simpler. This commit also makes CountedBitSet as a drop-in replacement for BitSet. Relates #27268
Currently, translog operations are read and processed one by one. This
may be a problem as stale operations in translogs may suddenly reappear
in recoveries. To make sure that stale operations won't be processed, we
read the translog files in a reverse order (eg. from the most recent
file to the oldest file) and only process an operation if its sequence
number was not seen before.
Relates to #10708